Back

BioData Mining

Springer Science and Business Media LLC

Preprints posted in the last 7 days, ranked by how well they match BioData Mining's content profile, based on 15 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.

1
Evaluating Large Language Models for Transparent Quality-of-Care Measurement in Children with ADHD

Bannett, Y.; Pillai, M.; Huang, T.; Luo, I.; Gunturkun, F.; Hernandez-Boussard, T.

2026-04-17 pediatrics 10.64898/2026.04.12.26350732 medRxiv
Top 0.2%
2.1%
Show abstract

ImportanceGuideline-concordant care for young children with attention-deficit/hyperactivity disorder (ADHD) includes recommending parent training in behavior management (PTBM) as first-line treatment. However, assessing guideline adherence through manual chart review is time-consuming and costly, limiting scalable and timely quality-of-care measurement. ObjectiveTo evaluate the accuracy and explainability of large language models (LLMs) in identifying PTBM recommendations in pediatric electronic health record (EHR) notes as a scalable alternative to manual chart review. Design, Setting, and ParticipantsThis retrospective cohort study was conducted in a community-based pediatric healthcare network in California consisting of 27 primary care clinics. The study cohort included children aged 4-6 years with [≥] 2 primary care visits between 2020-2024 and ICD-10 diagnoses of ADHD or ADHD symptoms (n=542 patients). Clinical notes from the first ADHD-related visit were included. A stratified subset of 122 notes, including all cases with model disagreement, was manually annotated to assess model performance in identifying PTBM recommendations and rank model explanations. ExposuresAssessment and plan sections of clinical notes were analyzed using three generative large language models (Claude-3.5, GPT-4o, and LLaMA-3.3-70B) to identify the presence of PTBM recommendations and generate explanatory rationales and documentation evidence. Main Outcomes and MeasuresModel performance in identifying PTBM recommendations (measured by sensitivity, positive predictive value (PPV), and F1-score) and qualitative explainability ratings of model-generated rationales (based on the QUEST framework). ResultsAll three models demonstrated high performance compared to expert chart review. Claude-3.5 showed balanced performance (sensitivity=0.89, PPV=0.95, and F1-score=0.92) and ranked highest in explainability. LLaMA3.3-70B achieved sensitivity=0.91, PPV=0.89, and F1-score=0.90, ranking second for explainability. GPT-4o had the highest PPV [0.97] but lowest sensitivity [0.82], with an F1-score of 0.89 and the lowest explainability ranking. Based on classifications from the best-performing model, Claude-3.5, 26.4% (143/542) of patients had documented PTBM recommendations at their first ADHD-related visit. Conclusions and RelevanceLLMs can accurately extract guideline-concordant clinician recommendations for non-pharmacological ADHD treatment from unstructured clinical notes while providing clear explanations and supporting evidence. Evaluating model explainability as part of LLM implementation for medical chart review tasks can promote transparent and scalable solutions for quality-of-care measurement.

2
An independent supervisory safety agent improves reaction of large language models to suicidal ideation

Trivedi, S.; Simons, N. W.; Tyagi, A.; Ramaswamy, A.; Nadkarni, G. N.; Charney, A. W.

2026-04-15 psychiatry and clinical psychology 10.64898/2026.04.13.26350757 medRxiv
Top 0.4%
1.3%
Show abstract

Background: Large language models (LLMs) are increasingly used in mental health contexts, yet their detection of suicidal ideation is inconsistent, raising patient safety concerns. Objective: To evaluate whether an independent safety monitoring system improves detection of suicide risk compared with native LLM safeguards. Methods: We conducted a cross-sectional evaluation using 224 paired suicide-related clinical vignettes presented in a single-turn format under two conditions (with and without structured clinical information). Native LLM safeguard responses were compared with an independent supervisory safety architecture with asynchronous monitoring. The primary outcome was detection of suicide risk requiring intervention. Results: The supervisory system detected suicide risk in 205 of 224 evaluations (91.5%) versus 41 of 224 (18.3%) for native LLM safeguards. Among 168 discordant evaluations, 166 favored the supervisory system and 2 favored the LLM (matched odds ratio {approx}83.0). Both systems detected risk in 39 evaluations, and neither in 17. Detection was highest in scenarios with explicit suicidal ideation and lower in more ambiguous presentations. Conclusions: Native LLM safeguards frequently failed to detect suicide risk in this structured evaluation. An independent monitoring approach substantially improved detection, supporting the role of external safety systems in high-risk mental health applications of LLMs.

3
Fine-Tuning PubMedBERT for Hierarchical Condition Category Classification

Wang, X.; Hammarlund, N.; Prosperi, M.; Zhu, Y.; Revere, L.

2026-04-15 health systems and quality improvement 10.64898/2026.04.13.26350814 medRxiv
Top 0.5%
1.2%
Show abstract

Automating Hierarchical Condition Category (HCC) assignment directly from unstructured electronic health record (EHR) notes remains an important but understudied problem in clinical informatics. We present HCC-Coder, an end to end NLP system that maps narrative documentation to 115 Centers for Medicare & Medicaid Services(CMS) HCC codes in a multi-label setting. On the test dataset, HCC-Coder achieves a macro-F1 of 0.779 and a micro-F1 of 0.756, with a macro-sensitivity of 0.819 and macro-specificity of 0.998. By contrast, Generative Pre-trained Transformer (GPT)-4o achieves highest score of a macro-F1 of 0.735 and a micro-F1 of 0.708 under five-shot prompting. The fine-tuned model demonstrates consistent absolute improvements of 4%-5% in F1-scores over GPT-4o. To address severe label imbalance, we incorporate inverse-frequency weighting and per-label threshold calibration. These findings suggest that domain-adapted transformers provide more balanced and reliable performance than prompt-based large language models for hierarchical clinical coding and risk adjustment.

4
Characteristics of individuals with cerebral palsy across the United States

Aravamuthan, B. R.; Bailes, A. F.; Baird, M.; Bjornson, K.; Bowen, I.; Bowman, A.; Boyer, E.; Gelineau-Morel, R.; Glader, L.; Gross, P.; Hall, S.; Hurvitz, E.; Kruer, M. C.; Larrew, T.; Marupudi, N.; McPhee, P.; Nichols, S.; Noritz, G.; Oleszek, J.; Ramsey, J.; Raskin, J.; Riordan, H.; Rocque, B.; Shah, M.; Shore, B.; Shrader, M. W.; Spence, D.; Stevenson, C.; Thomas, S. P.; Trost, J.; Wisniewski, S.

2026-04-16 pediatrics 10.64898/2026.04.14.26350870 medRxiv
Top 1.0%
0.7%
Show abstract

Objective Cerebral palsy (CP) affects approximately 1 million Americans and 18 million individuals worldwide, yet contemporary US epidemiologic data remains limited. We aimed to use Cerebral Palsy Research Network (CPRN) clinical registry to describe demographics and clinical characteristics of individuals with CP across the US and determine associations with gross motor function and genetic etiology. Methods Registry subjects were included if they had clinician-confirmed CP and prospectively entered data for Gross Motor Function Classification System (GMFCS) Level, gestational age, genetic etiology, CP distribution, and tone/movement types. Logistic regression was used to determine which of these variables plus race, sex, ethnicity, and age were associated with GMFCS level and genetic etiology. Results A total of 9,756 children and adults with CP from 22 CPRN sites met inclusion criteria. Participants were predominantly White (73.0%), male (57.3%), non-Hispanic (87.8%), and younger than 18 years (73.7%). Most were classified as GMFCS levels I-III (55.6%), born preterm (52.8%), had spasticity (83.8%), and had quadriplegia (41.9%); 12.2% were identified as having a genetic etiology. Tone/movement types, CP distribution, and gestational age were significantly associated with both GMFCS level and genetic etiology (p<0.001). Compared to White individuals, Black individuals were more likely to have greater gross motor impairment (p<0.001). Conclusion In this large US cohort, clinical and demographic factors, including race, were associated with gross motor function and genetic etiology in CP. These findings highlight persistent disparities and demonstrate the value of a national clinical registry for informing prognostication, quality improvement efforts, and targeted genetic testing strategies.

5
An Exploratory Study on the Long-Term Impact of Voiding Cystourethrogram (VCUG)

McDonald, A.; Sullivan, K.

2026-04-17 pediatrics 10.64898/2026.04.15.26350983 medRxiv
Top 2%
0.4%
Show abstract

OBJECTIVE This study investigates the long-term impacts of childhood exposure to voiding cystourethrogram (VCUG), a diagnostic procedure for vesicoureteral reflux. Primary outcomes include long-term health outcomes, mental health disorders, healthcare avoidance, and participation in risky behaviors compared to a control group. METHODS A 9-month retrospective cohort study was conducted with adults who received most of their medical care in the U.S. Respondents self-reported health metrics, behaviors, and outcomes via a 20-minute survey. Respondents were divided into two groups: those who remembered undergoing at least one VCUG in childhood (VCUG group), and those who did not (control group). RESULTS Of 334 respondents, 204 (61%) were in the VCUG group (mean age: 29, 70% female) and 130 (39%) were controls (mean age: 34, 70% female). Notable findings include: 47% of VCUG respondents were diagnosed with depression compared to 27% of controls. 15% of female-born VCUG respondents reported they would never visit a gynecologist compared to 2% of controls. 34% of VCUG respondents smoked regularly compared to 5% of controls, and 11% of VCUG respondents regularly missed work compared to 1% of controls. These findings highlight the need for further research and clinical consideration of VCUG's long-term consequences. CONCLUSIONS This study suggests that the effects of childhood VCUG extend into adulthood. Our findings underscore the need to reassess informed consent protocols and consider full-scale studies to minimize bias.

6
Performance of open-source large language models on nephrology self-assessment program

Ahangaran, M.; Jia, S.; Chitalia, S.; Athavale, A.; Francis, J. M.; O'Donnell, M. W.; Bavi, S. R.; Gupta, U. D.; Kolachalama, V. B.

2026-04-16 nephrology 10.64898/2026.04.16.26348910 medRxiv
Top 2%
0.4%
Show abstract

Background: Large Language Models (LLMs) have demonstrated strong performance in medical question-answering tasks, highlighting their potential for clinical decision support and medical education. However, their effectiveness in subspecialty areas such as nephrology remains underexplored. In this study, we assess the performance of open-source LLMs in answering multiple-choice questions from the Nephrology Self-Assessment Program (NephSAP) to better understand their capabilities and limitations within this specialized clinical domain. Methods: We evaluated the performance of five open-source large language models (LLMs): PodGPT which a podcast-pretrained model focused on STEMM disciplines, Llama 3.2-11B, Mistral-7B-Instruct-v0.2, Falcon3-10B-Instruct, and Gemma-2-9B-it. Each model was tested on its ability to answer multiple-choice questions derived from the NephSAP. Model performance was quantified using accuracy, defined as the proportion of correctly answered questions. In addition, the quality of the models explanatory responses was assessed using several natural language processing (NLP) metrics: Bilingual Evaluation Understudy (BLEU), Word Error Rate (WER), cosine similarity, and Flesch-Kincaid Grade Level (FKGL). For qualitative analysis, three board-certified nephrologists reviewed 40 randomly selected model responses to identify factual and clinical reasoning errors, with performance summarized as average error ratios based on the proportion of error-associated words per response. Results: Among the evaluated models, PodGPT achieved the highest accuracy (64.77%), whereas Llama showed the lowest performance with an accuracy of 45.08%. Qualitative analysis showed that PodGPT had the lowest factual error rate (0.017), while Llama and Falcon achieved the lowest reasoning error rates (0.038). Conclusions: This study highlights the importance of STEMM-based training to enhance the reasoning capabilities and reliability of LLMs in clinical contexts, supporting the development of more effective AI-driven decision-support tools in nephrology and other medical specialties.

7
SIEVE: Locus-Anchored Drug Prioritization for Complex Disorders

Strobl, E. V.

2026-04-17 pharmacology and therapeutics 10.64898/2026.04.15.26350958 medRxiv
Top 2%
0.3%
Show abstract

Motivation: Complex disorders arise from multiple genetic mechanisms, but most drug-prioritization methods treat each disorder as a single phenotype and therefore miss locus-specific therapeutic opportunities. Results: We present SIEVE, a framework that decomposes complex disorders into genetically localized subphenotypes and links GWAS summary statistics, reference expression, and perturbational transcriptional profiles to prioritize compounds that target locus-anchored disease mechanisms. SIEVE also constructs genetically calibrated mechanism vectors, projects away nonspecific expression programs using negative anchors, and aggregates evidence across cell lines, doses, and time points to produce robust drug rankings. Across simulations and analyses of real data, SIEVE improves compound prioritization relative to existing methods and shows that subphenotype-aware, genetics-guided modeling can sharpen therapeutic discovery in heterogeneous disorders. Availability and Implementation: R implementation: github.com/ericstrobl/SIEVE.

8
Risk factors, outcomes, and predictors of therapeutic response in preterm infants with patent ductus arteriosus: A retrospective cohort study

Hamida, H. B.; El Ouaer, M.; Abdelmoula, S.; El Ghali, M.; Bizid, M.; Chamtouri, I.; Monastiri, K.

2026-04-17 pediatrics 10.64898/2026.04.10.26350668 medRxiv
Top 2%
0.3%
Show abstract

BackgroundPatent ductus arteriosus (PDA) is a common and potentially serious cardiovascular condition in preterm infants, particularly those with low gestational age and birth weight. Its management remains controversial due to variability in screening, diagnostic criteria, and treatment strategies. This study aimed to evaluate risk factors, outcomes, and management strategies for PDA in preterm infants, and to identify predictors of clinical and echocardiographic response to therapy. MethodsWe conducted a retrospective cohort study over a 4-year period (2016-2019) in the neonatal intensive care unit (NICU) of a tertiary care center. All consecutive preterm infants admitted during the study period were eligible. Infants with echocardiographically confirmed PDA who received pharmacological treatment with intravenous paracetamol or ibuprofen were included in the analysis. Missing data were minimal and handled using available-case analysis. Statistical analyses included descriptive statistics, Pearsons chi-square test, and multivariable logistic regression. ResultsAmong 2154 preterm infants admitted to the NICU, 60 were diagnosed with PDA (incidence : 2.8%). The mean gestational age was 29 {+/-} 2.6 weeks, and the median birth weight was 1200 g. Respiratory distress occurred in 95% of cases, mainly due to hyaline membrane disease (86.7%). PDA was symptomatic in 80% of infants. First-line treatment resulted in clinical improvement in 77% and ductal closure in 83.3% of cases, most within 3 days. Predictors of successful closure included gestational age [&ge;] 28 weeks (OR = 5.9; 95% CI : 1.7-20.2) and antenatal corticosteroid exposure (OR = 1.2; 95% CI : 1.0-1.6). Overall mortality was 35% and was significantly higher in infants < 28 weeks (OR = 5.0; 95% CI : 2.4-10.3). Clinical improvement (OR = 3.7) and echocardiographic closure (OR = 4.5) after first-line treatment were associated with reduced mortality. ConclusionsPDA in preterm infants is associated with substantial morbidity and mortality, particularly in those born before 28 weeks of gestation. Early diagnosis, antenatal corticosteroid exposure, and timely pharmacological treatment may improve outcomes. Systematic echocardiographic screening in high-risk neonates should be considered.

9
Time to Discharge and Associated Factors Among Preterm Neonates Admitted to Kiwoko Hospital, Nakaseke District, Uganda: A Competing Risks Analysis

Mutibwa, S.; Wandiembe, S.; Mbonye, K.; Nsimbe, D.

2026-04-15 pediatrics 10.64898/2026.04.13.26350793 medRxiv
Top 3%
0.2%
Show abstract

Background: Preterm births contribute to approximately 35% of neonatal deaths globally, with an estimated 13.4 million infants born prematurely each year. Despite this substantial burden, limited evidence exists on time to discharge and its determinants among preterm neonates admitted to Neonatal Intensive Care Units (NICUs), particularly in rural Ugandan settings. This study aimed to investigate time to discharge and associated factors among preterm neonates admitted to Kiwoko Hospital in Nakaseke District, Uganda. Methods: A retrospective cohort study was conducted using secondary data from Kiwoko Hospital on preterm neonates admitted to the Neonatal Intensive Care Unit (NICU) between 2020 and 2021 (n = 847). The cumulative incidence function was used to estimate the probability of discharge within 28 days of admission, accounting for competing events. A Fine and Gray sub-distribution hazard regression model was fitted to identify factors associated with time to discharge. Results: Of the 847 preterm admissions, 70.1% were discharged alive within 28 days. The median time to discharge was 14 days. The cumulative incidence of discharge by 28 days was 68%, accounting for competing events. During follow-up, 165 neonates did not complete the 28-day period, including 88 deaths. Factors significantly associated with time to discharge included place of delivery (SHR: 0.62; 95% CI: 0.53-0.73; p<0.001), maternal residence in other districts (SHR: 0.69; 95% CI: 0.48-0.99; p=0.044), extreme preterm (SHR: 0.05; 95% CI: 0.03-0.09; p<0.001), very preterm (SHR: 0.18; 95% CI: 0.14-0.25; p<0.001), moderate preterm (SHR: 0.59; 95% CI: 0.46-0.76; p<0.001), triplet births (SHR: 0.40; 95% CI: 0.23-0.68; p=0.001), 2-4 ANC visits (SHR: 0.70; 95% CI: 0.56-0.87; p=0.002), <=1 ANC visit (SHR: 0.64; 95% CI: 0.49-0.85; p=0.002), respiratory distress syndrome (SHR: 0.64; 95% CI: 0.48-0.74; p<0.001), and birth trauma (SHR: 2.62; 95% CI: 1.60-4.29; p<0.001). Conclusions: Respiratory distress syndrome, fewer antenatal care visits, out-of-district residence, and higher degrees of prematurity were associated with prolonged time to discharge among preterm neonates. Strengthening antenatal care utilization and improving access to quality neonatal care in underserved areas may enhance discharge outcomes.

10
Triage Administration of Ondansetron for Gastroenteritis in children; a randomized controlled trial

Weill, O.; Lucas, N.; Bailey, B.; Marquis, C.; Gravel, J.

2026-04-15 pediatrics 10.64898/2026.04.13.26350796 medRxiv
Top 3%
0.2%
Show abstract

Objectives: Acute gastroenteritis is a leading cause of pediatric emergency department (ED) visits. While ondansetron reduces vomiting, intravenous rehydration, and hospital admissions, its efficacy when initiated at triage remains unclear. We aimed to evaluate whether triage nurse-initiated administration of ondansetron in children with suspected gastroenteritis reduces the proportion of patients requiring observation following initial physician assessment. Methods: We conducted a randomized, double-blind, placebo-controlled trial in a tertiary pediatric ED in Canada. Children aged 6 months to 17 years presenting with morae than 3 episodes of vomiting in the preceding 24 hours (including 1 within 2 hours of arrival), were eligible. At triage, we randomized participants to receive liquid ondansetron or a color- and taste-matched placebo. The primary outcome was the proportion of patients requiring observation after the first physician evaluation. Secondary outcomes included post-intervention vomiting, ED length of stay, patient comfort, and 48-hour return visits. The trial was registered at ClinicalTrials.gov (NCT03052361). Results: Recruitment was stopped prematurely due to the COVID-19 pandemic. Ninety-one participants were randomized to ondansetron (n= 44) or placebo (n= 47). Overall, 40 patients (45%) were discharged immediately after the initial physician assessment, with no difference between the ondansetron and placebo groups (44% vs. 45%; absolute difference -1%, 95% CI: -20% to 19%). No significant differences were observed in all secondary outcomes. Conclusion: In this trial, triage nurse-initiated ondansetron administration did not reduce the need for ED observation in children with presumed gastroenteritis. While being underpowered, this study could inform researchers planning larger clinical trials.

11
Pneumonia Detection in Paediatric Chest X-Rays using Ensembled Large Language Models

Tan, J.; Tang, P. H.

2026-04-12 radiology and imaging 10.64898/2026.04.10.26347909 medRxiv
Top 4%
0.2%
Show abstract

Background: Paediatric pneumonia is a leading cause of childhood morbidity and mortality worldwide. Chest X-rays (CXR) are an important diagnostic tool in the diagnosis of pneumonia, but shortages in specialist radiology services lead to clinically significant delays in CXR reporting. The ability to communicate findings both to clinicians and laypersons allows MLLMs to be deployed throughout clinical workflows, from image analysis to patient communication. However, MLLMs currently underperform state-of-the-art deep learning classifiers. Objective: To evaluate the diagnostic accuracy of ensemble strategies with MLLMs compared to the baseline average agent for paediatric radiological pneumonia detection. Methods: We conducted a retrospective cohort study using paediatric CXRs from two independent hospital datasets totalling 2300 CXRs. Fifteen MedGemma-4B-it agents independently classified each CXR into five pneumonia likelihood categories. Majority voting, soft voting, and GPTOSS-20B aggregation were compared against the average agent performance. The primary metric evaluated was OvR AUROC. Secondary metrics included accuracy, sensitivity, specificity, F1-score, Cohen's kappa, and OvO AUROC. Results: Soft voting achieved improvements in OvR AUROC (p_balanced = 0.0002, p_real-world = 0.0003), accuracy (p_balanced = 0.0008, p_real-world < 0.0001), Cohen's Kappa (p_balanced = 0.0006, p_real-world = 0.0054) and OvO AUROC (p_balanced < 0.0001, p_real-world = 0.0011) across both datasets, and a superior F1-value (pbalanced = 0.0028) for the balanced dataset. Conclusion: Soft voting enhances MedGemma's diagnostic discriminatory performance for paediatric radiological pneumonia detection. Our system enables privacy-preserving, near real-time clinical decision support with explainable outputs, having potential for integration into emergency departments. Our system's high specificity supports triage by flagging high-risk radiological pneumonia cases.

12
Assessing Swedish Genetic Counselling Outcome Measures for Autism and General Use: Rasch Findings Highlight the Need for Improved Measures

Nordstrand, M.; Fajutrao Falk, S.; Johansson, M.; Pestoff, R.; Tammimies, K.

2026-04-15 genetic and genomic medicine 10.64898/2026.04.13.26350766 medRxiv
Top 4%
0.2%
Show abstract

Genetic counselling outcome measures are increasingly adapted for diverse clinical contexts. While the Genetic Counselling Outcome Scale (GCOS-24) is available in Swedish, no autism-specific version has been developed. Therefore, we adapted the Swedish GCOS-24 using the English version of the modified GCOS-24 (mGCSOS-24) to create a Swedish autism-specific mGCOS-24. Thereafter, we evaluated both the Swedish autism mGCOS-24 and the Swedish general GCOS-24 using Rasch analysis to assess their psychometric properties. Both instruments exhibited structural challenges, including multidimensionality, disordered thresholds, local item dependence, and invariance issues. For the Swedish autism mGCOS-24, we were able to identify subscales with acceptable measurement properties. However, applying the same structure to the Swedish general GCOS-24 did not resolve its broader limitations. This study introduces the first Swedish autism-specific mGCOS-24 and represents the first Rasch-based evaluation of any GCOS-24 or mGCOS-24 in Swedish. Our findings highlight important opportunities for measure refinement but also indicate that new or more substantially adapted tools may be needed to capture outcomes of genetic counselling in autistic populations.

13
Deriving LD-adjusted GWAS summary statistics through linkage disequilibrium deconvolution

Nouira, A.; Favre Moiron, M.; Tournaire, M.; Verbanck, M.

2026-04-11 genetic and genomic medicine 10.64898/2026.04.10.26350574 medRxiv
Top 4%
0.2%
Show abstract

Genome-wide association studies (GWAS) have identified numerous genetic variants associated with complex traits. However, linkage disequilibrium (LD) confounds these associations, leading to false positives where non-causal variants appear associated because they are correlated with nearby causal variants. This is particularly the case in highly polygenic traits where the genome can be saturated in causal variants. To address this issue, we propose LDeconv a method based on truncated singular value decomposition (SVD) that adjust GWAS summary statistics without requiring individual-level genotype data. This approach accounts for LD structure, isolates causal variants in high-LD regions, and improve the reliability of effect size estimates. We assess its performance through simulations across various LD scenarios, conduct extensive sensitivity analyses, and apply them to real GWAS data from the UK Biobank. Our results demonstrate that LDeconv effectively reduces false discoveries while preserving true associations, offering a robust framework for post-GWAS analysis.

14
Dynamic and Baseline Multi-Task Learning for Predicting Substance Use Initiation in the ABCD Study

Wei, M.; Zhang, H.; Peng, Q.

2026-04-13 addiction medicine 10.64898/2026.04.10.26350655 medRxiv
Top 4%
0.1%
Show abstract

Background: Early initiation of substance use is linked to later adverse outcomes, and risk factors come from multiple domains and are shared across substances. In our previous work, traditional time-to-event Cox models identified individual risk factors, but these models are not designed to jointly model multiple outcomes or capture complex non-linear relationships. Multi-task learning (MTL) can leverage shared structure across related outcomes to improve prediction and distinguish common versus substance-specific predictors. However, most MTL studies rely on baseline features and focus on single outcomes, which limits their ability to capture shared risk and temporal changes. Substance use initiation is a time-dependent process that unfolds during development and reflects changing exposures over time. Baseline-only models cannot capture these changes or represent risk dynamics. Discrete-time modeling provides a practical approach by estimating interval-level initiation risk and combining it into cumulative risk at the subject level. By integrating multi-task learning with dynamic modeling, it is possible to share information across outcomes while capturing how risk evolves over time, which may improve prediction performance. Methods: Using the Adolescent Brain Cognitive Development (ABCD) Study (release 5.1), we developed two complementary multi-task learning (MTL) frameworks to predict initiation of alcohol, nicotine, cannabis, and any substance use. A baseline MTL model predicted fixed- horizon (48-month) initiation using one record per participant, while a dynamic discrete-time MTL model incorporated longitudinal interval data to model time-varying risk. Both models used multi-domain environmental exposures, core covariates, and polygenic risk scores (PRS). Performance was evaluated on a held-out test set using AUROC, PR-AUC, and calibration metrics, and compared with single-task logistic regression (LR). Feature importance was assessed using permutation importance and compared with Cox proportional hazards models. Results: MTL showed comparable or improved performance relative to LR, with larger gains for low-prevalence outcomes (cannabis and nicotine). Incorporating longitudinal information led to consistent improvements across all outcomes. Dynamic models increased AUROC by +0.044 to +0.062 for MTL and +0.050 to +0.084 for LR, indicating that temporal information was the primary driver of performance gains. Feature importance analyses showed modest overlap across methods, with higher agreement between dynamic MTL and Cox models than static MTL. A small set of features, including externalizing behavior, parental monitoring, and developmental factors, were consistently identified across all approaches. Conclusions: Dynamic multi-task learning improves the prediction of substance use initiation by leveraging longitudinal structure and shared information across outcomes. While MTL provides additional gains, incorporating time-varying information is the dominant factor for improving performance. Combining baseline and dynamic frameworks offers a comprehensive strategy for identifying robust risk factors and modeling adolescent substance use initiation.

15
Ad-verse Effects: Pharmaceutical Advertising Shifts Drug Recommendations by Consumer-Facing AI

Omar, M.; Agbareia, R.; McGreevy, J.; Zebrowski, A.; Ramaswamy, A.; Gorin, M.; Anato, E. M.; Glicksberg, B. S.; Sakhuja, A.; Charney, A.; Klang, E.; Nadkarni, G.

2026-04-16 health policy 10.64898/2026.04.14.26350868 medRxiv
Top 4%
0.1%
Show abstract

Large language models are increasingly used for clinical guidance while their parent companies introduce advertising. We tested whether pharmaceutical ads embedded in the prompts of 12 models from OpenAI, Anthropic, and Google shift drug recommendations across 258,660 API calls and four experiments probing distinct epistemic conditions. When two drugs were both guideline appropriate, advertising shifted selection of the advertised drug by +12.7 percentage points (P < 0.001), with some model scenario pairs shifting from 0% to 100%. Google models were the most susceptible (+29.8 pp), followed by OpenAI (+10.9 pp), while Anthropic models showed minimal change (+2.0 pp). When the advertised product lacked evidence or was clinically suboptimal, models resisted. This reveals a structured vulnerability: advertising does not override medical knowledge but fills the space where clinical evidence is underdetermined. An open response sub analysis (2,340 calls across three representative models) confirmed that advertising restructures free-text clinical reasoning: models echoed ad claims at 2.7 times the baseline rate while maintaining high stated confidence and rarely disclosing the ad. Susceptibility was provider dependent (Google: +29.8 pp; OpenAI: +10.9 pp; Anthropic: +2.0 pp). Because this bias operates within clinically correct answers, it is invisible to accuracy based evaluation, identifying a class of AI safety vulnerability that standard testing cannot detect.

16
Cross-cultural adaptation and psychometric validation of the ISBAR Structured Handover Observation Tool in ICU-to-ward patient transfer

Ni, N.; Zhao, B.; Wang, Y.; Wang, Q.; Ding, J.; Liu, T.

2026-04-14 nursing 10.64898/2026.04.10.26350669 medRxiv
Top 4%
0.1%
Show abstract

Abstract The ISBAR framework is used to standardize clinical handovers and enhance patient safety. Observational tools based on ISBAR have been developed to assess the completeness of information transfer. However, these instruments have primarily been developed in non-Chinese contexts, and validated Chinese-language observational tools suitable for clinical practice remain limited. In this study, a cross-cultural adaptation and psychometric validation of the ISBAR Structured Handover Observation Tool was conducted, examining its reliability and discriminant validity in Chinese clinical settings. The study was conducted in two phases: cross-cultural adaptation and psychometric evaluation in real-world clinical settings. Content validity was assessed using the Content Validity Index (CVI), and inter-rater reliability was evaluated using the Intraclass Correlation Coefficient (ICC) based on a two-way mixed-effects model with absolute agreement. Discriminant validity was examined using the Mann-Whitney U test to compare scores across nurses with varying levels of clinical experience. A total of 233 handover cases involving patient transfers from the intensive care unit (ICU) to general wards were collected, involving 84 nurses. The scale demonstrated good content validity, with item-level content validity indices (CVI) ranging from 0.88 to 1.00 and a scale-level CVI/Ave of 0.98. The inter-rater reliability, assessed using fifty randomly selected cases, was high, with an intraclass correlation coefficient (ICC) of 0.885 for single-rater assessments and 0.939 for average-rater assessments. Discriminant validity analysis showed that nurses with more clinical experience had significantly higher total scores than those with less experience (Z = -4.772, p < 0.001). The Chinese version of the ISBAR Structured Handover Observation Tool demonstrates good content validity, high inter-rater reliability, and acceptable discriminant validity. This tool provides a standardized and practical method for assessing the completeness of information transfer and is expected to support quality improvement in patient handover from the ICU to general wards in Chinese clinical settings.

17
Patient-Centred Communication in Lung Cancer Screening: A Clinically Focussed Evaluation of a Fine-Tuned Open-Source Model Against a Larger Frontier System

Khanna, S.; Chaudhary, R.; Narula, N.; Lee, R.

2026-04-11 oncology 10.64898/2026.04.10.26350595 medRxiv
Top 4%
0.1%
Show abstract

Lung cancer screening saves lives, yet uptake remains suboptimal and inequitable. Personalised communication can improve attendance and reduce anxiety, but scaling such support is a workforce challenge. We fine-tuned Googles Gemma 2 9B using QLoRA on 5,086 synthetic screening conversations and compared it against Googles Gemini 2.5 Flash (a larger frontier model) and an unmodified baseline across 300 multi-turn conversations with 100 patient personas spanning ten clinical categories. Evaluation combined automated natural language processing metrics with independent language model judgement in two complementary modes: structured clinical rubric and simulated patient persona. The fine-tuned model achieved the highest simulated patient experience score (3.71/5 vs 3.65 for the frontier model), recorded zero boundary violations after clinician review of all flagged instances, and led on the four most safety-critical categories. A composite Patient Adaptation Index showed that the fine-tuned model led overall (0.37 vs 0.35 vs 0.35), with its clearest advantage on the two clinically specific components: empathy calibration to patient distress and selective smoking cessation signposting. These findings suggest that targeted fine-tuning of open-source models can yield clinical communication quality comparable to larger proprietary systems, with advantages in safety-critical scenarios and suitability for NHS data governance constraints. Human clinician review of these conversations is ongoing.

18
Medicalbench: Evaluating Large Language Models Towards Improved Medical Concept Extraction

Yang, Z.; Lyng, G. D.; Batra, S. S.; Tillman, R. E.

2026-04-16 health informatics 10.64898/2026.04.12.26350704 medRxiv
Top 4%
0.1%
Show abstract

Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts, such as diagnoses, are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts and provide limited coverage of cases in which medically relevant concepts must be inferred. We present MedicalBench, a new benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note concept pairs, coupled with sentence level evidence identification. Built from MIMIC-IV discharge summaries and human verified ICD-10 codes, the dataset is curated through a multi stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. Annotators provide sentence level evidence spans and concise medical rationales. The final dataset contains 823 high quality examples. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs and a supervised baseline reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that explicitly incorporating reasoning cues and prompting to extract implicit evidence substantially improves medical concept extractions, while performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

19
Leveraging State-of-the-Art LLMs for the De-identification of Sensitive Health Information in Clinical Speech

Dai, H.-J.; Mir, T. H.; Fang, L.-C.; Chen, C.-T.; Feng, H.-H.; Lai, J.-R.; Hsu, H.-C.; Nandy, P.; Panchal, O.; Liao, W.-H.; Tien, Y.-Z.; Chen, P.-Z.; Lin, Y.-R.; Jonnagaddala, J.

2026-04-17 health informatics 10.64898/2026.04.13.26349911 medRxiv
Top 4%
0.1%
Show abstract

Accurate recognition and deidentification of sensitive health information (SHI) in spoken dialogues requires multimodal algorithms that can understand medical language and contextual nuance. However, the recognition and deidentification risks expose sensitive health information (SHI). Additionally, the variability and complexity of medical terminology, along with the inherent biases in medical datasets, further complicate this task. This study introduces the SREDH/AI-Cup 2025 Medical Speech Sensitive Information Recognition Challenge, which focuses on two tasks: Task-1: Speech transcription systems must accurately transcribe speech into text; and Task-2: Medical speech de-identification to detect and appropriately classify mentions of SHI. The competition attracted 246 teams; top-performing systems achieved a mixed error rate (MER) of 0.1147 and a macro F1-score of 0.7103, with average MER and macro F1-score of 0.3539 and 0.2696, respectively. Results were presented at the IW-DMRN workshop in 2025. Notably, the results reveal that LLMs were prevalent across both tasks: 97.5% of teams adopted LLMs for Task 1 and 100% for Task 2. Highlighting their growing role in healthcare. Furthermore, we finetuned six models, demonstrating strong precision ([~]0.885-0.889) with slightly lower recall ([~]0.830-0.847), resulting in F1-scores of 0.857-0.867.

20
Adherence to International Pharmacogenomic Recommendations in Paediatric Cancer Care: A Cohort Analysis Embedded Within the MARVEL-PIC Randomised Trial

Chawla, A.; Carter, S.; Dyas, R.; Williams, E.; Moore, C.; Conyers, R.

2026-04-16 genetic and genomic medicine 10.64898/2026.04.15.26348678 medRxiv
Top 4%
0.1%
Show abstract

Background: Pharmacogenomic testing (PGx) can optimise drug efficacy and minimise toxicity, but the extent of prescriber adherence to PGx recommendations remains unclear. We aimed to quantify clinician adherence to international genotype-guided prescribing recommendations in a cohort of paediatric oncology patients. Methods: We reviewed files of children enrolled in the MARVEL-PIC (NCT05667766) randomised control trial, who had PGx recommendations available. Patients were included if 12 weeks had passed since their PGx report was released to clinicians. Prescribing events were identified for actionable PGx recommendations, and classified as "explicitly followed", "inadvertently followed", or "not followed". Adherence was assessed by patient, drug, and recommendation. Results: 2,063 PGx recommendations were available for 216 patients. 64 (3.1%) recommendations were actionable for 44 patients and 10 drugs within the 12-week study period. Recommendations were explicitly followed in 57/288 (19.8%) of prescribing events, inadvertently followed in 145 (50.3%), and not followed in 86 (29.9%). Mercaptopurine demonstrated the highest rate of explicit adherence (87.5%). No significant associations were observed between adherence and age group, cancer type, drug type, or strength of recommendation. Conclusion: Adherence to pharmacogenomic recommendations was very low, highlighting the need to understand barriers to PGx implementation, and consideration of clinical decision supports to facilitate adherence.